Abstract
Thank you for looking at the report for my simulation code sample in
R! This is a report for a project I thought of when I came
across while doing a questionnaire for the World Bank’s Development
Impact Evaluation (DIME). The question was along the lines of:
“There is a program that is implemented at the village level.
Households within the same village are very similar but households
between villages are not. To maximize the likelihood of detecting the
programs effect is it better to sample more households within each
village or to sample more villages?”
In this report I will answer this question through the use of
simulation techniques in which programs with different effect sizes are
implemented on a sample of randomly selected villages. I will also go
over some of the theory and intuition for the answer and take this
opportunity to talk about some sampling techniques, mainly on clustered
sampling vs. stratified sampling vs. systematic sampling. However, the
main objective of this report is to demonstrate my skills in
R as such this document will mainly focus on the
code itself.
for any given problem there are many different solutions and paths. While some paths may be more efficient or shorter than others. There are two necessary conditions for good code. Good code must be:
One can not come at the expense of the other. These principles are behind every decision and form the backbone of every script in every language. I am confident you will see this reflected in my work.
One of the most important ways to create good code is to follow good coding practices from the beginning, as going back to fix things will always be costlier than starting out the right way. Throughout this report you will see blue text boxes that explain some of the styling decisions made throughout the script to guarantee the functionality and readability.
This is an example of a style textbox.
Formal writing and academic writing disclamer XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX plural first person active voice is used to engage with the reader
Making sure your code is reproducible and portable is also essential
for good code. I always create a new R-project for each assignment,
maintain Renvironments through renv and
detailed records of every change through version control (as you can
probably tell by reading this document on GitHub). In fact, this project
not only has renv to increase its reproducibility and
portability it also contains a mamba directory with the
.Rprofile and config.yml files needed to
guarantee that no matter when or in what system, this project is
100% reproducible and portable. Just remeber not to use
mamaba and renv as they can conflict with each
oter.
Fedback here XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
This is a question I came across while completing a questionnaire for the World Bank’s Development and Impact Evaluation (DIME). While not a verbatum quote, the question was:
There is a program that is implemented at the village level. Households within the same village are very similar but households between villages are not. To maximize the likelihood of detecting the programs effect is it better to sample more households within each village or to sample more villages?
Intuitively one may think that it s better to sample more villages. If households within each village are similar then the information that an additional household from a village that has already been sampled contributes to the regression’s power is less than a household from village that is unsampled and which there for, different to all the other households in the sample.
As mentioned above, intuitively one might expect that sampling households from different villages would increase the statistical significance of the estimator. Lets take a look at the first graph. It’s worth saying that all these graphs are interactive yo you may pan, rotate, zoom, etc. as well as hover over the plot to see the number of villages per each treatment group (i.e. treated and control), total sample size and the p-vale with ’*’ at each of the usual significance thresholds (10%, 5%, 1%).
From this graph it is not immediately obvious that either sampling strategy is better than the other. In fact, it seems as if the surface descends at the same rate regardless if you are increasing the number of households per village or if you are increasing the number of villages per treatment group. Additionally, the surface of this graph is very rough, there are many local maxima and local minima scattered throughout. This is of course expected as a result from idiosyncratic errors. However it is nevertheless surprising as this graph shows the average p-value over 1000x runs.
Let’s now see how this graph changes as we increase the effect size. Once again I encourage you to explore each graph.